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Abstract 

Background: Although nearly half of the human genome is comprised of repetitive sequences, the expression 
profile of these elements remains largely uncharacterized. Recently developed high throughput sequencing 
technologies provide us with a powerful new set of tools to study repeat elements. Hence, we performed whole 
transcriptome sequencing to investigate the expression of repetitive elements in human frontal cortex using 
postmortem tissue obtained from the Stanley Medical Research Institute. 

Results: We found a significant amount of reads from the human frontal cortex originate from repeat elements. 
We also noticed that Alu elements were expressed at levels higher than expected by random or background 
transcription. In contrast, LI elements were expressed at lower than expected amounts. 

Conclusions: Repetitive elements are expressed abundantly in the human brain. This expression pattern appears to 
be element specific and can not be explained by random or background transcription. These results demonstrate 
that our knowledge about repetitive elements is far from complete. Further characterization is required to 
determine the mechanism, the control, and the effects of repeat element expression. 



Background 

Roughly half of the human genome is comprised of 
repetitive elements; these elements range from the 6Kb 
LINEl to micro and minisatellites [1-4]. The biological 
role of repetitive elements is not known and in general 
they are believed to be nonfunctional sequences. Some 
are remnants of ancient germline infections and trans- 
position events [3,5]. These include the human endo- 
genous retroviruses which at one time presumably 
existed as infectious exogenous agents [6]. Long and 
Short Interspersed Nuclear Elements (LINE/SINE) are 
active retrotransposons which continually insert them- 
selves into the human genome, though the vast majority 
of the elements have mutated beyond function [3,7,8]. 

Endogenous retroviruses and transposons are largely 
inactive either through DNA methylation or histone 
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modifications [9,10], or because they have accumulated 
mutations over time. However, active transposition by 
LINEs does occur in humans and occasionally causes dis- 
ease when the insertion occurs in an exon [8], splice site, 
or regulatory region [11]. Insertion of LI elements in the 
antisense orientation can also truncate transcripts by 
creating new polyadenylation sites (a phenomenon known 
as gene breaking)[12]. Repetitive elements alter gene 
expression by introducing promoter regions near genes, or 
through many other documented mechanisms [13-15]. It 
has been suggested that repetitive elements contribute to 
disease pathology by acting as superantigens or by causing 
an auto-immune response through molecular mimicry, 
but these hypotheses are largely speculative [16-18]. Sur- 
prisingly, certain endogenous retrovirus proteins have 
acquired beneficial functions in their hosts. For example, 
the HERV-W env gene, also known as human syncytin-1, 
contributes to trophoblast cell fusion during placental 
development [19]. Additionally, the expression of endo- 
genous retroviral and retrotranposon sequences is 
hypothesized to play a role in preventing infections by 
exogenous retroviral agents [6]. However, evidence for this 
theory is tenuous and the current dogma maintains that 
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repeat elements need to be silenced and selected against 
because they have adverse effects. 

Even if the expression of repeat elements is not harmful, 
repetitive elements have no known function. Because of 
this, microarray designs generally exclude repeats, both to 
conserve probes for known functional elements and to 
avoid having probes whose genomic position is necessarily 
unknown. Hence, the extent of repeat element expression 
remains largely unknown. However, there is evidence that 
certain repetitive elements are expressed at the RNA level 
in adult somatic cells and some are even translated into 
proteins [20,21]. 

High throughput sequencing (HTS) allows the entire 
transcriptome of a sample to be sequenced in an unbiased 
manner and is a good method for analyzing genome wide 
repetitive element expression. In a HTS experiment, gene 
expression is determined by counting the number of reads 
which are present from a particular transcript and normal- 
izing over the total number of reads observed. Two recent 
papers used this approach to assess the level of repeat ele- 
ments expression and found repetitive elements to be 
more widely expressed than previously thought [22,23]. 
One of the papers examined repeat expression in the cere- 
bellum, an area mainly involved in motor movements, and 
reported higher than expected levels of Alu element 
expression [22]. However, the authors concluded that 
expression of other repeat elements was due to "transcrip- 
tional background." Indeed, earlier studies suggest that a 
huge proportion of the human genome is transcribed [24] 
and transcripts of repetitive elements could be derived 
from such "promiscuous transcription." We performed 
deep transcriptome sequencing on the frontal cortex of 
ten different human postmortem brains and aligned the 
reads from these samples to all repetitive regions in the 
human genome. The frontal cortex is involved in complex 
cognitive functions and abnormalities in this cortical area 
are associated with various psychiatric illnesses. Consistent 
with earlier studies, we found that a large proportion of all 
aligned reads mapped to repeat elements. Unlike previous 
studies which sequenced pooled samples, we sequenced 
our samples independently and found that repeat expres- 
sion patterns are stable between individuals. We also 
found that Alu element transcripts are expressed at dis- 
proportionately high levels whereas truncated LINE ele- 
ments are expressed at levels much lower than expected. 
Hence we conclude that the expression of repeat elements 
is not a side effect of promiscuous, random transcription. 

Results 

Repetitive Elements are expressed in the frontal cortex 

In order to determine the expression level of repetitive 
elements in cortical tissue, mRNA was extracted from 10 
post mortem frontal cortex samples, converted to cDNA 
and paired end HTS libraries were constructed. Initially, 



four lanes of paired end sequencing was performed on 
three samples (RUNl and RUN2 Additional File 1: Table 
SI) on an lUumina Genome Analyzer. In order to 
increase our sample size, we sequenced an additional 
seven samples (RUNS Additional File 1: Table SI) on an 
Illumina HiSeq 2000. We aligned the sequencing data 
from these runs to a human repeats database that we 
constructed. The database contains the reference genome 
(NCBI36/hgl8 build) sequence of every annotated repeat 
region in repBase with a 50 bp flank on each side of the 
repeat. In this way, we could be reasonably certain that a 
short read that did not align to our database was not 
derived from a repetitive sequence. The flanking 
sequences were included so that a short read that 
spanned the junction of a repeat element with unique 
genomic DNA would not be missed. We avoided simply 
aligning the sequencing reads to the consensus sequences 
for the repeat classes since transcripts originating from 
older or slightly divergent members of a class of repeti- 
tive elements may be missed using this approach, and 
many of the consensus sequences contain ambiguity 
characters. Using this custom database has several advan- 
tages over simply aligning the reads to the genome 
assembly, and then selecting repeats that intersect with 
the repeats annotations. First, our method allows us to 
include repeats annotated on the "random" chromo- 
somes. Second, we are able to capture reads that other- 
wise might align to the part of the assembly that is 
repeat-like, but was not annotated as a repeat in RepBase. 
When we map directly onto our repeats database, such 
reads will align to the closest member of the appropriate 
repeat family and won't be excluded from the analysis. 

We found a large proportion of mappable reads (an 
average of 8%) in brain mRNA to originate from repeat 
sequences. These results are consistent with the findings 
from a previous report which used the CAGE tagging 
technology to map repeat elements [23]. We also calcu- 
lated the number of reads which aligned uniquely to 
repeat masked intronic and intergenic regions. In our 
study, the number of reads from intronic and intergenic 
regions (Additional File 1: Table SI) is much lower than 
previous estimates for brain expression from these geno- 
mic compartments [22]. In addition, the samples 
sequenced in RUN3 were treated with DNase prior to 
library construction. Hence, DNA contamination is unli- 
kely to account for our findings. We hypothesized that if 
repeat elements are present in such abundant amounts in 
the cortex, then we should be able to detect these ele- 
ments using Northern blot analysis, a technique which 
requires high to moderate quantities of the target RNA. 
We used the human endogenous retrovirus-W (HERV-W) 
gag sequence as a probe and performed Northern blot 
analysis on RNA blots containing material from various 
different human tissues (Figure 1). Several bands are 
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Figure 1 Detection of HERV-W gag transcripts by Northern blot analysis. The Northern blot demonstrates that some repetitive elements 
are highly expressed in brain and other tissues. This is consistent with our whole transcriptome sequencing data where we found a large 
number of reads mapping to repeat element sequences. 



detected in each lane of this blot, consistent with the pre- 
sence of many HERV-W gag sequences of varying lengths 
in the genome. These blots confirm that HERV-W gag 
transcripts are abundantly expressed not only in the brain 
but also in several other tissue types including pancreas 
and the heart. Furthermore, the differential banding 



pattern and intensity of signal between lanes suggests that 
the expression of these elements is tissue specific. 

Repetitive Element types 

To further characterize repetitive element expression, we 
wanted to determine which types of elements are 
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expressed. Our repeats database contains 141, 145, 063 
bp, or about 45% of the genome, as expected. As this 
database is by definition internally repetitive, we expect 
that most short reads that align to the database will align 
to more than one sequence. Instead of attempting to 
recover ail possible alignments for each short read, we 
chose to keep only one alignment for any read that 
matched to the database, so that the final counts accu- 
mulated along a "consensus" element for each repeat 
type, though the reads could align to many instances of 
that repeat in the genome. This strategy prevented us 
from determining whether any particular genomic 
regions are more active than others, but enabled us to 
more confidently determine the expression level of each 
class of repeats, taken as a group. When we count the 
reads using this approach, we find a significant amount 
of Alu, L1/L2, and to a lesser extent, endogenous retro- 
virus expression in all of the frontal cortex samples 
(Figure 2). In addition, we downloaded existing RNA 
sequencing data performed on the human brain from the 
short reads archive (GSM475204-GSM475209). The dis- 
tribution of repeat element transcripts in this data, called 
MAQC, (Additional File 1: Table SI) is very similar to 
our samples (Figure 2). If repetitive regions were tran- 
scribed simply at random, the distribution of short reads 
mapping to these regions should roughly mirror the per- 
centage of the genome occupied by each repeat class. To 
determine whether the observed read counts correlated 
with the number of reads that would be generated at ran- 
dom (and to correct for possible sequence composition- 
related problems within the data), we created a set of 
simulated reads from the database of repeat sequences by 
taking 76 bp subsequences (the size of the original reads) 
at random from all regions in our repeat database. In this 



way, we created 100 datasets that contained approxi- 
mately the same number of repeat-derived reads as what 
we found from the first two runs of our sequencing data 
(-400 K short reads). Alignment of each of these 100 sets 
of simulated short reads to the database of repeats gives 
the expected outcome if each repeat class in the genome 
were transcribed at random. These simulations are repre- 
sented in green bars in Figure 2. According to this analy- 
sis, Alu/Sine repeats are expressed at higher levels than 
expected whereas LI transcripts are underrepresented. 
We repeated this analysis for the CCDS database, which 
contains only expressed sequences, to account for the 
possibility that the pattern of repetitive expression is dri- 
ven by repeat elements which are present in coding 
sequences. For this simulation experiment we sampled 7, 
881, 000 reads - an average number of reads produced by 
the first two sequencing runs for the frontal cortex sam- 
ples. The result of this analysis is shown in purple bars in 
Figure 2. The repetitive sequence derived from uniform 
expression of CCDS annotations can not account for the 
observed repetitive element expression pattern. However, 
we know that the sequences in coding regions are not 
expressed equally and it is quite possible that the high 
levels of Alu expression may be driven by a single or sev- 
eral Alu sequences which are associated with genes that 
are expressed in high levels. Since one Alu-derived read 
can not be mapped specifically to any genomic Alu 
sequence, this possibility can not be ruled out. 

To determine if the differences we found between our 
simulations, which assumed that repeat element tran- 
scription occurred simply at random, and the sequencing 
data are statistically significant, we used a negative bino- 
mial test designed to test for differential expression in 
the RNA Seq data [25]. Based on this test, the difference 
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Figure 2 Distribution of reads by repeat family in the frontal cortex. Distribution of the reads aligning to the repeats by family observed in 
the frontal cortex samples and in simulated reads drawn from the database of coding sequences (random CCDS) and from the repetitive 
sequence (random repeats). Height of the bars corresponds to average frequencies of the reads aligned to a particular family, the error bars 
indicate plus/minus standard deviation. Averages and standard deviations for samples are derived from pooling the samples, and for simulated 
data over 100 simulation runs. 
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between the expected (simulated random expression) and 
observed transcription pattern for repeat elements is sta- 
tistically significant (Table 1). 

LINE-1 elements are unique in that when they trans- 
pose, the new copy often ends up severely 5' truncated. 
Only full-length LI elements are active, so the vast major- 
ity of the copies of LI in the genome are inactive. Upon a 
careful examination of the 5' end of LI, we observed a 
clear increase in the number of reads observed from the 
first few bases of the element, over the simulated data 
(Additional File 2: figure SI). This indicates that these full- 
length elements (the only ones capable of retrotransposi- 
tion) may be expressed at reasonably high levels, though 
the bulk of the LI sequences are underexpressed and the 
5' end of LI is the least abundant segment of this element 
in the genome. Because full length LI elements carry their 
own promoter, it is possible that transcripts from these 
elements are initiated independently from within the LI, 
not from neighboring genes. 

Discussion 

Our results indicate that repetitive elements in the gen- 
ome are abundantly transcribed in human cortical tis- 
sue. A large number of the reads from the sequencing 
runs aligned to the repeats database we constructed. 
These results are consistent with the finding of Faulkner 
et al. [23] but they are much lower than the estimates of 
Xu et al [22] who report 40% of all transcripts are 
derived from repetitive sequences. We can not compare 
our results with Xu et al since they used total RNA as 
their starting material whereas we used poly-A purified 
RNA to construct our HTS-libraries. In addition, our 
study examined cortical tissue which has significantly 
different gene expression and DNA methylation profiles 
than the cerebellum [26]. Also, the Northern blot 



analysis we performed independently validates our find- 
ings and assures us that our observations are not due to 
technical issues related to DNA contamination or pecu- 
liarities in the alignment software. Finally, the sequen- 
cing data in this paper was generated on two different 
sequencing platforms, using two separate library pre- 
paration techniques by two independent laboratories. 
The analysis of our sequencing runs along with the 
MAQC data set yield similar results and demonstrate 
the robustness of our findings. 

The Northern blot analysis we performed suggests that 
HERV-W gag expression is regulated in a tissue specific 
manner and these results are consistent with the findings 
of Nellaker et al [27]. Since the different tissues from the 
Northern blots did not originate from the same sample, 
we may be detecting inter individual variation for HERV- 
W gag expression, which in itself would suggest that this 
expression is regulated. It is possible, though, that tissue- 
specific epigenetic differences can drive or repress repeat 
expression, and this phenomenon can be explored in 
future studies. 

The biological process behind expressing repeat ele- 
ments at such high levels remains unclear, though random 
transcription does not explain the expression patterns 
seen and another mechanism must be involved. The 
majority of these elements do not have open reading 
frames (due to mutations) and most likely do not encode 
proteins, though the full length LI elements are possibly 
functional. RNA editing of Alu sequences has been 
described [28]; the presence of this mechanism suggests 
that Alu elements are expressed and then suppressed. 
Other elements have less clear roles; in fact, LINE-1 retro- 
transposition activity has not been correlated with its 
expression. Suggesting a possible function for the high 
repeat expression levels at this point would be purely 



Table 1 Significance Tests for Observed Repeat Element Transcription 



Repeat Family 


Fold Change 


p-value 


Adjusted for multiple comparisons p-value 


Alu_SINE 


0.8162648 


O.OOE+00 


O.OOE+00 


ERV 


1 .626032 


O.OOE+00 


O.OOE+00 


TcMar 


1.1997436 


3.18E-92 


4.23E-92 


hAT 


0.8364789 


2.42E-140 


3.63E-140 


DNA 


0.4241692 


8.22E-50 


9.86E-50 


LI 


2.1400934 


O.OOE+00 


O.OOE+00 


L2 


0.769354 


O.OOE+00 


O.OOE+00 


LTR 


1.6255619 


2.05E-06 


2.05E-06 


CR1 


0.6404078 


5.19E-201 


1 .04E-200 


Satellite 


2.8788933 


2.78E-167 


477E-167 


IVIIR 


0.4463942 


O.OOE+00 


O.OOE+00 



Other 1 .14991 57 3.49E-35 3.81 E-35 



We compared our sequencing data to our simulated reads {simulated reads represent what is expected if repeat element transcription was merely at random) 
and tested for differential expression. The data was modeled using a negative binomial distribution. The fold changes were derived based on this model. 
Benjamini-Hochberg method was used to correct for multiple testing. P-values adjusted for multiple testing are shown in the last column. 
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speculative; here we attempt only to establish a baseline 
for further analyses. 

Despite making up nearly half the human genome, very 
little is known about the expression patterns of repetitive 
elements. In this article we report that repetitive elements 
make up a significant proportion of the cortical transcrip- 
tome. In addition to characterizing the tissue-specific 
expression profile, the inter-individual transcription varia- 
tion needs to be determined for these elements. Our data 
indicate inter-individual variation in the expression of 
repeat element families whereas the total number of repeat 
elements in the transcriptome of each individual appears 
to be fairly consistent. Establishing the normal expression 
state of repeat elements would be particularly useful for 
determining if alterations to this expression pattern can 
cause (or indicate) disease. The presence of the RNA edit- 
ing machinery which appears to target Alu sequences, 
along with repetitive element expression that is not consis- 
tent with a simple "background expression" model sug- 
gests that our knowledge of these not-so-silent elements is 
still incomplete. 

Conclusions 

In this article, we report the abundant expression of repeat 
elements in the human cortex. Remarkably, random back- 
ground expression can not account for the observed lower 
than expected LINE-1 and higher than expected Alu ele- 
ment transcription. In addition, our data suggests that full 
length LINE-1 elements may be expressed at higher levels 
than truncated copies. Further characterization is required 
to determine the basis and the consequences of repeat ele- 
ment expression. 

Methods 

Samples 

All postmortem brain tissue samples were kindly donated 
by the Stanley Medical Research Institute. Frontal cortex 
tissue was dissected from the medial frontal gyrus and 
total RNA was purified using the RNeasy kit (QIAGEN, 
CA). The quality of each sample was verified by running 
an aliquot on a Bioanalyzer (Agilent Technologies, CA). 

Library preparation and sequencing 

Three sequencing runs were performed for this paper. 
RUNl and RUN2 were performed at Cold Spring Harbor 
Laboratory and RUNS was performed at Johns Hopkins 
University. For the 3 samples in RUNl and RUN2 10 \ig 
of TRIzol-extracted total RNA was cleaned up using the 
RNeasy Mini Kit (QIAGEN, CA) prior to cDNA cloning. 
To isolate the mRNA fraction from each sample, the total 
RNA was treated to two consecutive rounds of denatura- 
tion at 65°C and binding to Dynabeads 01igo(dT)25 (Invi- 
trogen, CA). The purified mRNA samples were then 
fragmented using Fragmentation Buffer (Ambion, CA) 



and recovered by ethanol precipitation with glycogen as a 
carrier. For first strand cDNA synthesis, the mRNA sam- 
ples were incubated with Random Hexamer Primers (Invi- 
trogen, CA), First Strand Buffer (Invitrogen, CA), 
RNaseOUT (Invitrogen, CA), DTT, dNTP mbc, and Super- 
Script II (Invitrogen, CA) in a thermal cycler using the fol- 
lowing program: 25°C for 10 min, 42°C for 50 min, 70°C 
for 15 min, and 4°C hold. For second strand cDNA synth- 
esis, the first strand samples were mixed with second 
strand buffer (50 mM Tris-HCl, pH 7.8; 5 mM MgClj; 
and 1 mM DTT), dNTP mix, RNase H (Invitrogen, CA), 
and DNA Polymerase I (Invitrogen, CA), followed with 
incubation at 16°C for 2.5 h. The cDNA samples were pur- 
ified using the QIAquick PGR Purification Kit (QIAGEN, 
CA). Following purification, the samples were converted 
into cDNA libraries using a series of steps derived from 
the Paired-End Sample Prep Kit (Illumina, CA). The modi- 
fications to the standard library preparation procedure are 
as follows: (1) after the addition of 3'-A overhangs, the 
samples were purified using the MinElute PGR Purifica- 
tion Kit (QIAGEN) and the entire eluates were used for 
the subsequent Illumina adapter ligation step, (2) the 
selected library fragment size was 200 bp, (3) the gel 
extraction step was carried out using the MinElute Gel 
Extraction Kit (QIAGEN, CA) and 20 ^iL of Buffer EB for 
elution, and (4) the entire gel extraction eluates were used 
for PGR enrichment of the libraries using a sixteen-cycle 
program. The completed cDNA libraries were individually 
sequenced on an Illumina Genome Analyzer producing 
100 base reads from each end. For the 7 samples in RUN3, 
5 \ig of RNA was treated with Turbo DNase (Ambion, 
CA) and paired end sequencing libraries were generated 
using the TruSeq RNA Sample prep kit following the 
manufacturer's recommended protocol. These libraries 
were sequenced on a HiSeq 2000 producing 100 base 
reads from each end. 

Northern blots 

Premade blots containing 1 \ig of mRNA derived from 
adult human tissues were purchased from Biochain 
(Hayward, CA). RNA probes were generated using the 
Riboprobe kit (Promega, WI) and P-32 labeled cytosines. 
The hybridization and washes on the blot were carried 
out using the Northern Max kit (Ambion, CA). 

Reads mapping and simulations 

We treated beginnings and ends of the paired-end reads as 
technical replicates, and aligned them separately. As we 
accept any alignment of a read to our database of repeats, 
splitting the paired ends does not compromise our ability 
to correctly assign reads to repeat families. All reads for 
the frontal cortex samples were trimmed to 76 base 
length. We aligned the reads to reference subgenomes 
(RepBase, intronic, and intergenic regions) using Bowtie 
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software http://bowtie-bio.sourceforge.net/index.shtml[29] 
using parameters that allow zero mismatches in the first 
20 bases of the read, and retain the best alignment for the 
read. The same parameters were used to align both actual 
and simulated reads. For alignments to introns and inter- 
genic regions we only accepted reads that aligned exactly 
once in the entire genome. Genomic coordinates used to 
build reference subgenomes were determined using UCSC 
Genome Browser [30] and sequences from hgl8 build 
were fetched using Galaxy (http://g2.bx.psu.edu/)[31]. 

In simulations we aimed to reproduce the scenario of 
random (uniform) transcription; while our simulated 
distributions closely mirror the genomic profiles of the 
sequences being evaluated, the simulations are critical to 
establishing statistical significance, as they control for 
any biases introduced by the alignment process. We 
extracted genomic sequences for RepBase annotated 
repeats, generated random start sites for the reads and 
extracted 76 base pairs for our simulated reads. Sam- 
pling from CCDS was done similarly, using consensus 
coding sequences from NCBI. We generated 100 data- 
sets, each comprising the appropriate number of reads 
(see Results) and read length. 

For statistical testing we pooled the data from the 
technical replicates of the same biological sample (we 
considered beginnings and ends of the paired ends to be 
technical replicates as well). We then compared read 
counts from these 10 samples mapping to various repeat 
families with read counts from our simulated data. We 
employed a statistical test proposed by Anders and 
Huber [25]. This test for differential expression in RNA- 
Seq data models the read counts mapping to a given 
repeat family using a negative binomial distribution with 
the parameters estimated from the observed data. We 
used R/Bioconductor package "DESeq" implementation 
of the test. This R/Bioconductor package performs mul- 
tiple testing correction using Benjamini-Hochberg 
method [32]. 

Additional material 



representative histogram of the distribution of these average starting 
points for one of the samples and a single draw of simulated reads. 
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